In a superficial way, this assignment is meant to make sure you’re familiar with plotting spatial data. However, the bulk of your time will most likely be devoted to wrangling and reshaping the data so that it’s ready to be graphed. As we move into the final stretch of the class, the hints will now become more sparse. As with all the previous homeworks, there’s no need to look up fancy packages or techniques. Everything can be done with the tools we already have unless stated otherwise.
The data are in the form that they were originally collected (except someone was nice enough to gather all the lat/long coordinates of the zip codes for you).
The data come from a Dialect Survey conducted by Bert Vaux. Some limited information can be found at the original depracated website http://www4.uwm.edu/FLL/linguistics/dialect/index.html. Although 122 questions were asked in the survey, the subset of the data provided to you only contains answers to the 67 questions that focused on lexical rather than phonetic differences.
There are three files included in this assignment:
question_data.Rdata, an Rdata file containing
quest.mat a data frame containing the questionsall.ans, a list of data frames containing answers to the questionslingData.txt, a space-separated data table where each observation represents a response to the survey
ID a unique ID for each participantCITY self-reported city of the participantSTATE self-reported state of the participantZIP self-reported zip code of the participantlat/long coordinates calculated from the center of each zip codeQ50-Q121 the participant’s response to a question. Some questions are missing in this range. A value of 0 indicates no response. Other numbers directly match their corresponding letter e.g. 1 should match with a.lingLocation.txt an aggregated data set. The responses from lingData.txt were turned into binary responses (e.g. “1 if Participant answered a on question 50 and 0 otherwise”). The data were then binned into 1 degree latitude by 1 degree longitude “squares”. Within each of these bins, the binary response were summed over individuals.
Cell a unique ID for each lat/long binLatitude/Longitude coordinates for the cellV4-V471 the number of responses for the corresponding question and answer in the cell. V4 corresponds to response a to question 50 while V468 corresponds to answer g for question 121 (the very last answer to the last question)Note that while the rows represent the same data in lingData.txt and lingLocation.txt, they are different observational units. For example, say John and Paul take this questionnaire for two questions. The first question has three answer choices and the second question has four answer choices. If John answered A and D and Paul answered B and D, then lingData would encode two vectors: (1, 4) and (2, 4). If they lived in the same longitude and latitude box, then it would be encoded in lingLocation as one vector: (1, 1, 0, 0, 0, 0, 2).
You’ll need read_delim from the readr package to read in the last two files. Remember to specify the delim argument, which demarcates how fields are separated in the text file.
Explore and clean the data. Document what was added/removed, explaining your actions.
lingData = read_delim("lingData.txt", delim = " ")
lingLocation = read_delim("lingLocation.txt", delim = " ")
load("~/Documents/Stat_133/stat133/hw7/question_data.RData")
ans_data = do.call(rbind, all.ans)
fix_tie = function(x){
str_c(x, collapse = "/", sep = "")
}
names(letters) = 1:26
names(state.name) = state.abb
states = map_data("state") %>%
rename(STATE = region) %>%
select(STATE, long, lat, group) %>%
mutate(STATE = str_to_title(STATE))
new_lingData = lingData %>%
gather(question_no, observation, -ID, -CITY, -STATE, -ZIP, -lat, -long) %>%
mutate(qnum = parse_number(question_no),
ans.let = str_replace_all(observation, letters),
STATE = str_replace_all(STATE, state.name)) %>%
left_join(ans_data) %>%
select(-lat, -long) %>%
left_join(quest.use)
modified_new_lingData = new_lingData %>%
filter(STATE %in% state.name) %>%
filter(STATE != "Hawaii") %>%
filter(STATE != "Alaska") %>%
group_by(qnum, STATE, ans) %>%
tally %>%
filter(n == max(n)) %>%
group_by(quest, STATE) %>%
mutate(ans = fix_tie(ans)) %>%
unique()
final_lingData = inner_join(states, modified_new_lingData)
write_csv(final_lingData, path = "usa_survey.csv")
We gather all the question number columns in the lingData dataframe into one column called question_no. Then we add a column called qnum in the lingData dataframe that parses the question numbers from strings to numbers. Then we create a column called letter_ans that converts the observation column to letters. We convert the STATE column to non-abbreviated names. We left join the lingData dataframe with the ans_data dataframe that contains the answers and question numbers. After removing the latitude and longitude columns, we left join again with the quest.use data frame. We further clean the data by taking out Hawaii and Alaska. Then we create a new column n that contains the tally of the number of answers given for a question. By creating a fix_tie function, we fix the special cases where the number of answers are the same for some of the answers. We join this cleaned data set with the states data set to include the latitude and longitude of the states.
Implement a Shiny App that colors a map of the continental US based off the most common answer for each state. The user should be allowed to pick one of the 67 questions from a dropdown menu. If a state has two or more answers that tied, the map should show the tie as a different color. A static example with a tie in West Virginia is shown below:
As with homework 6, include your server and ui code below along with a link to your app on shinyapps.io.
library(shiny)
usa_survey = read_csv("usa_survey.csv")
shinyUI(fluidPage(
titlePanel("Most Common Answer to Survey Question for Each State"),
sidebarLayout(
sidebarPanel(
selectInput("question",
label = "Choose a question:",
choices = na.omit(unique(usa_survey$quest)))
),
mainPanel(plotOutput("usa"))
)
))
library(shiny)
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(ggmap)
library(mapdata)
usa_survey = read_csv("usa_survey.csv")
shinyServer(function(input, output) {
output$usa = renderPlot({
usa_survey = usa_survey %>%
filter(quest %in% input$question)
ggplot(usa_survey) +
geom_polygon(aes(x = long, y = lat, fill = ans, group = group),
color = "black") +
coord_fixed(1.3) +
labs(title = str_wrap(input$question),
x = "",
y = "") +
scale_x_discrete(labels = "") +
scale_y_discrete(labels = "") +
theme_void() +
scale_fill_discrete(name = "Answer",
labels = str_wrap(levels(factor(usa_survey$ans)), width = 20))
})
})
Make visualization(s) of the lingLocation data for two questions that you found interesting. Remember that each row represents a 1x1 square centered at the given lat/long coordinate.
lingData = read_delim("lingData.txt", delim = " ")
lingLocation = read_delim("lingLocation.txt", delim = " ")
load("~/Documents/Stat_133/stat133/hw7/question_data.RData")
ans_data = do.call(rbind, all.ans)
ans_data50 = ans_data %>%
filter(qnum == 50) %>%
unite(num.let, qnum, ans.let, sep = "")
ans_data51 = ans_data %>%
filter(qnum == 51) %>%
unite(num.let, qnum, ans.let, sep = "")
modified_lingLocation = lingLocation %>%
select(-(V16:V471)) %>%
select(-`Number of people in cell`)
data50 = lingLocation %>%
rename("50a" = V4) %>%
rename("50b" = V5) %>%
rename("50c" = V6) %>%
rename("50d" = V7) %>%
rename("50e" = V8) %>%
rename("50f" = V9) %>%
rename("50g" = V10) %>%
rename("50h" = V11) %>%
rename("50i" = V12) %>%
select(-c(V13:V471)) %>%
gather(num.let,
obs,
-Cell,
-`Number of people in cell`,
-Latitude,
-Longitude) %>%
left_join(ans_data50) %>%
group_by(Cell) %>%
filter(obs == max(obs))
## Joining by: "num.let"
data51 = lingLocation %>%
rename("51a" = V13) %>%
rename("51b" = V14) %>%
rename("51c" = V15)%>%
select(-c(V4:V12)) %>%
select(-c(V16:V471)) %>%
gather(num.let,
obs,
-Cell,
-`Number of people in cell`,
-Latitude,
-Longitude) %>%
left_join(ans_data51) %>%
group_by(Cell) %>%
filter(obs == max(obs))
## Joining by: "num.let"
usa_data = map_data("usa")
usa_bbox = make_bbox(lat = lat, lon = long, data = usa_data)
usa_map = get_map(location = usa_bbox, source = "google", maptype = "hybrid")
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=37.25658,-95.844379&zoom=4&size=640x640&scale=2&maptype=hybrid&language=en-EN&sensor=false
ggmap(usa_map) +
geom_point(data = data50,
mapping = aes(x = Longitude, y = Latitude, color = ans),
size = 0.5) +
labs(title = "What word(s) do you use to address a group of two or more people?")
## Warning: Removed 40 rows containing missing values (geom_point).
ggmap(usa_map) +
geom_point(data = data51,
mapping = aes(x = Longitude, y = Latitude, color = ans),
size = 0.5) +
labs(title = "Would you say 'Are you coming with?' as a full sentence to mean 'Are you coming with us?'")
## Warning: Removed 38 rows containing missing values (geom_point).